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ABSTRACT 


The explosion of Open Educational Resources (OERs) in 
the recent years creates the demand for scalable, automatic 
approaches to process and evaluate OERs, with the end goal 
of identifying and recommending the most suitable educa- 
tional materials for learners. We focus on building models 
to find the characteristics and features involved in context- 
agnostic engagement (i.e. population-based), a seldom re- 
searched topic compared to other contextualised and per- 
sonalised approaches that focus more on individual learner 
engagement. Learner engagement, is arguably a more re- 
liable measure than popularity/number of views, is more 
abundant than user ratings and has also been shown to be 
a crucial component in achieving learning outcomes. In this 
work, we explore the idea of building a predictive model 
for population-based engagement in education. We intro- 
duce a novel, large dataset of video lectures for predicting 
context-agnostic engagement and propose both cross-modal 
and modality specific feature sets to achieve this task. We 
further test different strategies for quantifying learner en- 
gagement signals. We demonstrate the use of our approach 
in the case of data scarcity. Additionally, we perform a sen- 
sitivity analysis of the best performing model, which shows 
promising performance and can be easily integrated into an 
educational recommender system for OERs. 
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1. INTRODUCTION 


With the recent popularity of online learning platforms, the 
creation of Open Educational Resources (OERs) is increas- 
ing rapidly [16]. This recent large-scale creation of educa- 
tional material demands for ways to automatically manage 
educational resources. In the context of OERs, this means 
finding and recommending material that fits the learners’ 
goals while maximising learning outcomes. Such a goal usu- 
ally entails a large personalisation factor. We define it as 
conteztualised engagement, which captures how engaging a 
learning resource is with regard to the context of the learner 
(e.g., learning needs/goals and learner state). Although con- 
textualised engagement has gained interest in the recent 
years [8], we argue that there is also a context-agnostic en- 
gagement factor, that only relates to features of the learning 
resource and attempts to capture the gold-standard label 
of population-based engagement (i.e. the marginal of con- 
textual engagement for a resource across the population of 
learners). Modelling context-agnostic engagement enables 
identifying highly engaging resources across a population of 
learners before personalising educational recommendations 
to individuals. This paper studies the features involved in 
context-agnostic engagement, as a first step towards build- 
ing an integrative educative recommendation system, that 
will join both contextualised and context-agnostic features 


(9]. 


A high quality learning resource needs to satisfy three main 
properties: i) academic soundness and appropriate cover- 
age of the body of knowledge, ii) pedagogical robustness 
and iii) enabling learners to achieve their desired learning 
outcomes [24]. Learner engagement has been shown to be 
a proxy for (iii), as engaging with material is a prerequi- 
site for learning. There is evidence from both online [33, 
23] and classroom [30, 36] educational settings showing that 
higher learner engagement increases the likelihood of bet- 
ter learning outcomes. We thus focus on finding the general 
characteristics of engaging material. Using features that can 
be extracted across multiple modalities (video, text, audio 
etc.) allows developing prediction models for gold-standard 
engagement that are easily adaptable to a wide range of 
OERs and can be automated [27]. 


Our work is one of the first to address educational engage- 
ment prediction with video lectures, specially from a quan- 
titative perspective. One of our primary goals is to under- 
stand if easily automatable cross-modal features can be used 
as predictors for how engaging an educational resource is, as 
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opposed to modality specific features. Although large-scale 
studies (involving millions of videos) have been conducted 
to analyse the prediction of engagement for general purpose 
videos [40], the largest study in the context of educational 
video lectures involves 800 videos from 4 courses and anal- 
yses engagement from a qualitative perspective [17]. To the 
best of our knowledge, this work is the first attempt to pre- 
dict engagement with educational videos automatically. Our 
experiments involve more than 4000 video lectures that span 
over 20 diverse subjects, making it the largest dataset to date 
in this field. Our dataset, code and best performing model 
are released with the paper. 


Given the usefulness of predicting context-agnostic engage- 
ment and the scarcity of work in this topic, we are motivated 
to answer the following research questions, which will enable 
the deployment of such a model in an educational platform: 


RQ1 How to encode context-agnostic engagement? 


RQ2 How effective are cross-modal language-based features 
for predicting engagement with video lectures? 


RQ3 Does including modality-specific features lead to a sig- 
nificant improvement in performance? 


RQ4 What features influence context-agnostic engagement? 


RQ5 Is predicting marginal population-based engagement 
useful over personalised engagement? 


RQ6 Can we assume a common underlying model for pre- 
dicting engagement across different knowledge areas? 


2. RELATED WORK 


The interest in identifying useful and engaging information 
goes beyond the educational domain and is investigated in 
numerous other fields [10]. For example, Wikipedia uses a 
review system to evaluate the quality of its articles. To do 
so, different machine learning models, such as support vec- 
tor regression and ensemble methods, are used with features 
such as text style, readability, structure, network, recency 
and review information [14, 39]. Moreover, in the context 
of automatic essay scoring, promising results have been ob- 
tained through rank preference support vector machines [41] 
and more sophisticated deep learning models [37]. 


Quality-based document ranking [3] and spam web-page de- 
tection [28] are other areas in the information retrieval do- 
main that also utilises textual features and recency related 
features. These features categorise into different verticals 
such as understandability, topic coverage, presentation, fresh- 
ness and authority [10]. 


OERs available to the public come in large-scale and vari- 
ous modalities [27, 19], which makes modality-specific mod- 
els of limited use. As existing work proposes models with 
domain/modality specific features (e.g. network features of 
Wikipedia [15] or speaker speed in videos [17]), there is a 
need for models that can evaluate how engaging educational 
materials are at scale using a cross-modal feature set. We 
attempt to address this gap through this work. 


2.1 Why Modelling Engagement? 

As argued by Lane [24], a well designed learning resource 
should enable the learner to achieve the expected learning 
outcomes. Prior work has studied learner engagement in 
Massively Open Online Courses and shown that when op- 
timised, engagement can increase the likelihood of 
achieving better learning outcomes [33, 23]. User en- 
gagement has also been shown to differ greatly from popu- 
larity measures such as number of views [40], as the latter 
does not necessarily capture whether learners consume the 
material. In our work, we also show that engagement does 
not positively correlate with user ratings. Instead, what we 
observe is that lectures with low rating also present low en- 
gagement rate. However, lectures with greater ratings can 
have different engagement rates. 


For videos, watch time has been used as the main mea- 
sure for quantifying engagement in the literature, e.g., for 
YouTube recommendations [13], predicting engagement with 
videos [40]. For educational content, the median of nor- 
malised engagement time (i.e., the percentage of watch time 
from the total video) has been used as gold standard for 
engagement [17]. Our work tests several approaches to en- 
coding user engagement. 


Most of the related work regarding predicting educational 
engagement attempts to model learner engagement as a func- 
tion of the learner’s context (demography, user activity, etc.) 
[4, 19, 2], as opposed to modelling context-agnostic learner 
engagement as a function of content-based features of the 
educational resource, which is our aim. Context-agnostic en- 
gagement has been previously studied for video lectures, ad- 
vocating for qualitative and general recommendations such 
as keeping videos short [17], using conversational language 
for lecture delivery [5] and others. These recommendations 
empower authors to create better educational videos. How- 
ever, none of these works address the need for automatically 
identifying the features of highly engaging educational re- 
sources, which is imperative for retrieving and recommend- 
ing educational material at scale. 


3. DATA AND METHODOLOGY 


This section first describes the dataset built for predicting 
engagement, together with the set of features proposed in 
this paper. Then, we introduce the machine learning meth- 
ods and the feature importance analysis method considered. 


To address the research questions outlined in the introduc- 
tory section of this paper we do the following: i) We study 
different ways of refining user engagement signals, linking to 
literature on psychometrics (RQ1). ii) We propose two sets 
of easily automatable features for predicting engagement 
(cross-modal features inspired by context-agnostic quality 
literature and video-specific features) and evaluate the dif- 
ference of predictive performance between them (RQ2 and 
RQ3). iii) We construct a large dataset of video lectures 
and evaluate the performance of the proposed engagement 
signals and sets of features (RQ2-4). iv) We compare cross- 
modal to modality specific features, analysing the impact 
of individual features in the predictive model that presents 
the most promising performance (RQ4). v) We compare 
our population-based engagement approach to its person- 
alised analogue to demonstrate its usefulness (RQ5). vi) We 


51 Proceedings of The 13th International Conference on Educational Data Mining (EDM 2020) 


compare the engagement models obtained from dividing the 
video lectures in two differentiated knowledge areas: STEM 
(such as technology, physics and mathematics lectures) vs 
others (such as arts, social science and philosophy lectures). 


3.1 Dataset and Features (RQ2-4) 


We use data from a popular OER repository, VideoLec- 
tures.Net (VLN)*, a collection of videos of researchers pre- 
senting in peer-reviewed conferences. This data is suitable 
for our aim for two reasons: i) It contains watch patterns 
about how learners consume lectures, and ii) the lectures are 
peer-reviewed and hence material is controlled for correct- 
ness of knowledge and pedagogical robustness. The tran- 
scriptions of English lectures and English translations for 
the non-English lectures are provided by the TransLectures 
project”. We restrict the final dataset to lectures that has 
been viewed by at least 5 unique users, leading to the fi- 
nal dataset having 4,063 lectures. These lectures are cat- 
egorised into 21 subjects, e.g. Computer Science, Physics, 
Philosophy, etc. Learner engagement labels of the dataset is 
computed using 155,850 user view log events (video viewing 
events) created between December 8, 2016 and February 17, 
2018.The dataset constructed is publicly available, includ- 
ing different statistics of population engagement and all the 
cross-modal and video-based features proposed. 


3.1.1 Cross-modal Features 

We selected a subset of cross-modal and mostly language- 
based features that are easy to extract from the VLN dataset. 
The 13 extracted features are shown in Table 1. This set has 
been selected based on recurring features in the related work 
[3, 14, 17, 28, 39] and their quality verticals [10] identified 
in our prior work. The majority of features were extracted 
using methods and token (word) sets that are found in the 
prior work referenced in Table 1. 


Additionally, we introduce the published date, represented by 
converting the video publication date to UNIX epoch time 
(in days). In other words, it is the number of days between 
January 01, 1970 and the lecture published date. 


3.1.2 Video-based Features 

We also extracted four out of the seven features proposed 
for analysing educational engagement with video lectures 
from [17], selecting those features that can be automatised 
and are objective. These are: i) lecture duration, as shorter 
videos have been shown to be much more engaging; ii) is 
chunked, whether the lecture has been partitioned into mul- 
tiple parts; iii) a set of indicator variables describing the type 
of lecture, such as tutorial, workshop, etc; and iv) speaker 
speed, measured by the average amount of words spoken per 
minute. We also include the silence period rate (SPR), cal- 
culated using the special tags in the video transcripts that 
indicate silence. Formally, for a lecture @, this feature SPR(£) 
is calculated as follows: 


1 
DO 


teT(l) 


SPR(¢) = D(t)-Z(N(t) = "silence"), (1) 


where ¢ is a tag in the collection of tags T(¢) that belong to 
lecture £, N returns the type of tag t and D returns the du- 


liww.videolectures.net 
ww . translectures.eu 


Table 1: Extracted features from the VLN dataset. 


Feature Reference 
Content-based features 
Easiness (FK Easiness) 14 
Stop-word Presence Rate 28 
Stop-word Coverage Rate 28 
Document Entropy [3] 
Word Count 39 
Title Word Count [3] 
Preposition Rate 14 
Auxiliary Rate 14 
To Be Rate 14 
Conjunction Rate 14 
Normalization Rate 14 
Pronoun Rate 14 
Published Date — 
Video-based features 
Lecture Duration 17 
Is Chunked ii 
Video Lecture Type 1f 
Speaker speed LT 
Silence Period Rate (SPR) — 


ration of tag ¢ or lecture @ and Z(-) is the indicator function 
(returning 1 when the condition is verified, 0 otherwise). 


3.2 Quantifying Engagement (RQ1) 

Our work focuses on implicit user feedback (most specifi- 
cally, engagement). Implicit feedback (in the form of num- 
ber of views, engagement or any other measure that does not 
require the user to provide explicit feedback) has been used 
for building recommender systems for nearly two decades 
with great success [29, 20, 22], as an alternative to explicit 
ratings, which have a high cognitive load on users and thus 
are usually sparse. However, implicit signals have other chal- 
lenges associated with them. For example, implicit feedback 
is usually positive-only [20] and can contain effects such as 
popularity bias, i.e., there might be a bias towards more pop- 
ular items, whereas implicit feedback for other items may be 
very sparse. There has been several works investigating the 
relationship between explicit and implicit feedback [12, 34, 
42], which we also do through this work. 


The main measure that we use to quantify engagement is 
the Median of Normalised Engagement /watch Time 
(MNET), as it has been proposed as the gold standard 
for engagement with educational materials in previous work 
[17]. To have the MNET label in the range [0,1], we set 
the upper bound of MNET to 1. We observed in our ini- 
tial data analysis that MNET values in the VLN dataset 
follow a Log-Normal distribution, where it can be seen that 
most users generally abandon the lecture after a generally 
low time threshold. We hypothesise this may be because it 
takes some time to decide whether the content is relevant 
for the learner. Users that make it after this threshold seem 
more committed and thus the leaving rate is significantly 
lower. To address this, as this is usually a problem when 
using machine learning methods, we applied a log trans- 
formation to transform the engagement signal. The final 
label, Log Median Normalised Engagement Time (LMNET) 
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is computed using the following: 


LMNET(£) = In(max(MNET(¢), 1)). (2) 


To test if LMNET can be further improved, we compare 
this approach of encoding engagement to other alternative 
ways of quantifying and cleaning engagement signals, draw- 
ing inspiration from the literature on psychometrics and sub- 
jective assessment [21, 38], which focuses on explicit human 
feedback and assumes that users present cognitive biases and 
differences, with applications in preference ranking and mea- 
suring perception-based qualities, such as engagement. The 
intuition behind this is that different learners may have a 
different engagement threshold and scale, similarly as with 
explicit ratings [21]. We compare different approaches for 
defining engagement: 


1. Raw LMNET, as per Eq. (2) which considers that no 
user differences exist and the marginal over the popu- 
lation can be directly used as gold standard label for 
engagement, similarly as in [17]. 


2. Cleaned LMNET, for which we test the removal of 
bot-like users (those users with an average engagement 
rate less than 5%), which may have a detrimental fac- 
tor in the median of raw engagement. 


3. Standardised LMNET, in which we preprocess LM- 
NET per user (subtracting the mean of the user and 
dividing by the standard deviation), as commonly done 
with human ratings in order to remove user biases and 
differences [21]. In this scale, positive values indicate 
lectures that are more engaging than the mean of the 
user and vice versa. 


4. Comparative MNET, in which we exploit the law 
of comparative judgement and use psychometric scal- 
ing to go from user comparative engagement data to 
a probabilistically interpretable engagement scale [38, 
32]. More specifically, we assume that engagement 
data can only be compared per user (as users may have 
different biases, thresholds or engagement scales). To 
do so, we generated a matrix of engagement compar- 
isons (of the type: Did learner 7 prefer lecture A to B in 
terms of engagement?), which is used as the input for 
psychometric scaling, producing a final scale in which 
distances can be interpreted in terms of probability of 
greater engagement. 


As discussed, the limitation of these approaches is that they 
disregard the context of the learner and the temporal com- 
ponent that may inherently be present when engaging with 
educational material. A different measure to encode engage- 
ment is found in Wu et al. [40], where the main idea is to 
compare engagement relative to the length of the video. The 
authors propose this for entertainment videos. However, we 
argue against this approach in the case of educational ma- 
terial, as the aim is to take the learner to the desired state 
in the most efficient way, thus the general recommendations 
found in the literature of keeping videos as short as possible 
[17]. 


3.3. Machine Learning Models (RQ2) 


To learn to rank video lectures based on engagement, we 
evaluate the performance using pointwise ranking models. 
Regression algorithms predict the target variable in real 
value space (y € R), which allows them to create a global 
ranking of observations based on predictions. We also eval- 
uate the performance of engagement prediction using ker- 
nelised models. Kernelisation allows capturing non-linear 
patterns in data without having to operate in the respective 
basis. Although it is more computationally efficient than 
working in the non-linear space itself, it is more computa- 
tionally expensive than solving the non-kernelised problem. 
Our choice of kernel for the models is the Radial Basis Func- 
tion (RBF). RBF kernel is widely used in the literature and 
has mathematical connections to other popular kernels such 
as exponential and polynomial kernels [11, 35]. 


We use two regression algorithms, namely, Ridge Regression 
(RR) and Support Vector Regression (SVR) in primal form. 
We use RR as it is a widely used algorithm for regression 
[40] and SVR as it has performed well in a similar task in 
prior work [14]. We also evaluate the performance of the 
kernelised version of the same two algorithms (with RBF 
kernel), Kernelised Ridge Regression (KRR) and Kernelised 
Support Vector Regression (KSVR). This allows us to under- 
stand if there is non-linearity in the patterns that benefits 
the prediction task. In all four models discussed above, we 
employ standard scaling as these models are not scale in- 
variant. L2 regularisation is used to defend against overfit- 
ting and multicollinearity [26]. As ensemble techniques have 
shown to perform well in prior work [39], we also employ a 
Random Forest Regressor (RF) to evaluate its prediction ca- 
pabilities. This model is also capable of capturing non-linear 
patterns. 


3.3.1 Comparison to Personalised Models (RQ5) 
One of our aims is to compare the population-based model 
to its personalised counterpart. The idea in this case is to 
test if a common baseline can be assumed for all users. For 
this, we train the same machine learning models per user, 
using the features previously proposed. 


3.4 Feature Importance Analysis (RQ4) 
Understanding how different features influence engageability 
of materials is vital in educational domain as learners will 
be guided on life-changing pathways based on these judge- 
ments. In a conventional linear model such as RR or SVM, 
feature importance analysis is straightforward as the weight 
coefficients reflect the influence of features. 


In this paper we use SHapley Additive exPlanations (SHAP), 
which is a model-agnostic framework that quantifies the im- 
pact of features on the model predictions. It reliably esti- 
mates feature importance of complex model families such as 
ensembles [25]. A SHAP value is computed for every fea- 
ture of every prediction. Given a prediction and a feature, 
SHAP is computed by averaging how the prediction changes 
when the feature is present and vice versa. This procedure 
enables quantifying the contribution of each feature to the 
model prediction. By plotting all the SHAP values of the 
prediction data points in a SHAP summary plot, we can 
identify how each feature influences the prediction. By cal- 
culating the Mean Absolute SHAP (MAS) for each feature 
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f over the observations: 


N 
MAS; = vol SHAPs» |, (3) 


n=1 


we obtain a more quantitative understanding of feature in- 
fluence. N is the number of observations. 


4. EXPERIMENTS AND DISCUSSION 


This section shows the experimental setup and results for 
the different experiments conducted. 


4.1 Experimental Setup 

The evaluation of the machine learning models is performed 
using a 5-fold cross-validation for both feature sets. The 
performance of different machine learning models with dif- 
ferent engagement quantification approaches can be found 
in Table 2. The performance when video-specific features 
are added is found in Table 3. 


After gaining an understanding of model performance (see 
results in Table 2), we employ the best performing method 
and encoding for the rest of the analyses, using a hold-out 
validation with a train-test split of 70:30 to save computa- 
tion. That is, the model is trained on the 70% training set 
and interpreted using the 30% test set. The experiments 
were implemented using Scikit-learn [31], textatistic 
[18] and SHAP [25] python packages. The source code in 
python and dataset are publicly available’. 


4.1.1 Evaluation metrics 

Pairwise accuracy (Pair.) and Spearman Rank Order Cor- 
relation Coefficient (SROCC) are the ranking metrics we 
used to evaluate the ranking performance of machine learn- 
ing models with different engagement signal encodings. 


Identifying models that can rank between video lectures is 
the core objective of this work. Hence, we devise pairwise 
accuracy as the main evaluation metric. Pairwise accuracy 
is more intuitive for this task as it represents the fraction 
of pairwise comparisons where the model could predict the 
more engaging lecture. Another opportunity that pairwise 
comparison provides is the ability to restrict the comparisons 
to subsets of lecture pairs (e.g. lectures that belong to the 
same subject, lectures that have similar LMNET). 


In some of our experiments we also perform misranking anal- 
ysis and report the pairwise accuracy. Misranking could 
happen if a subset of examples is systematically difficult to 
rank. We hypothesize that misclassification happens more 
frequently as the difference of LMNET between a pair of 
video lectures gets smaller. That is, the model may strug- 
gle to differentiate between two lectures with similar en- 
gagement. By doing this analysis, we can also understand 
the sensitivity of the prediction model to similarly engaging 
lectures. Obviously, misranking a pair of lectures that are 
significantly different in engagement incurs a larger cost in 
terms of user satisfaction than misranking a pair of lectures 
with similar engagement. 


3nttps://github.com/sahanbull/ 
context-agnostic-engagement 


4.1.2 Controlling for Topics in Content 

The topics covered in the content of the lecture is likely to 
drive learner engagement. For instance, Data Science lec- 
tures can be more popular than Physics lectures leading to 
easy pairwise comparison predictions between the domains. 
To test this, we restrict in some experiments the pairwise 
accuracy calculation to pairs of lectures that belong to the 
same domain (subject-specific column in Table 3) and ob- 
serve if the accuracy value changes significantly compared 
to its counterpart metric that considers all lecture pairs in 
a domain-agnostic fashion. 


4.2 Results 


This section presents a series of experiments to: 


E1 Analyse the relationship between engagement, number 
of views and mean star ratings (RQ1). 


E2 Test different machine learning models and engage- 
ment signals for the cross-modal features (RQ1-2). 


E3 Study the distribution of engagement with respect to 
length of materials (RQ4). 


E4 Study the influence of modality-specific features and 
comparison across subject areas (RQ3). 


E5 Analyse the importance of different features in the 
model (RQ4). 


E6 Compare the population-based model to its person- 
alised counterpart (RQ5). 


E7 Test if the same underlying model can be assumed for 
different knowledge/subject areas (RQ6). 


4.2.1 El; Engagement vs Views and Ratings 

The VLN data source also has mean star ratings (explicit 
feedback) for a subset of the considered lectures. It is note- 
worthy that we only have access to mean star ratings, not 
to the individual ratings per observer or the number of mea- 
surements. As done in previous work, we also analyse the 
relationship between implicit signals (engagement and num- 
ber of views) and explicit ratings. This can be found in 
Figure 1, where we show mean star rating vs MNET and 
number of views. The SROCC is close to zero, mainly be- 
cause of the large number of lectures with high rating but low 
engagement and number of views. We test the correlation 
for the 4 different versions of engagement considered (raw, 
cleaned, standardised and comparative), but all achieve sim- 
ilar results, with SROCC close to zero. One conclusion that 
is clear from the plot in Figure 1 is that number of views, 
ratings and engagement do represent very different informa- 
tion. For example, it can be appreciated that the variance of 
MNET and number of views increases with higher ratings, 
showing heteroskedasticity. This indicates that for low qual- 
ity resources (with low ratings) engagement is generally low, 
whereas for resources with higher ratings engagement differs 
and may be either high or low. This suggest other factors 
involved in engagement than simply quality perceived by 
learners. Regarding number of views it seems that the cor- 
relation is rather negative, showing that the materials with 
the highest number of views present very low engagement. 
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Figure 1: Scatter plots showing the relationship between (i) number of views vs. MNET, (ii) mean star 
rating for the video lecture vs. MNET and (iii) mean star rating vs. number of views, together with the 
Spearman’s rank correlation coefficient (SROCC). 


Table 2: Pairwise accuracy (Pair.) and Spearman’s Rank Correlation Coefficient(SROCC) of engagement 
prediction models with standard error from 5-fold cross validation and cross-modal features. 


Model RR SVR KRR KSVR RF 
Engagement Pair. SROCC Pair. SROCC Pair. SROCC Pair. SROCC Pair. SROCC 
Raw -705£.011 | .581+.027 | .707+.000 | .586£.000 | .715+.004 | .607£.011 | .714+.007 | .604+.019 | .723+.009 | .625+.027 


Clearned | .636+.033 | 396.093 | .634+.031 | .392+.089 | .646+.025 | .424+.071 | .642+.028 | .414+.078 | .646+.031 | .427+.087 


Standard | .603+.035 | .302+.098 | .600+.035 | .292+.100 | .609+.035 | .315+.099 | .602+.025 | .297+.071 | .611+.035 | .3823+.099 


Comparative | .624+.010 | .365+.028 | .624+.012 | .363+.036 | .626+.013 | .370+.040 | .627+.009 | .373+.027 | .6364.012 | .397+.038 


4.2.2 E2: Encoding and Predicting Engagement 
Inherently, the task of finding a better engagement signal 
is very challenging, given the lack of ground truth. In this 
paper, we first attempt to see if any of these signals present 
better correlation with star ratings. However, we observe 
from Figure 1 that engagement is not strongly correlated 
with perceived quality by users (explicit star ratings) and 
similar results emerge for different methods of quantifying 
engagement, meaning it is inconclusive that transforming 
raw engagement signals strengthens its relationship to ex- 
plicit perceived quality. Thus, in order to decide on which 
is the best way of capturing and quantifying engagement, 
we compare the pairwise accuracy for the four proposed ap- 
proaches (raw LMNET, cleaned, standardised and compar- 
ative). This simply tells us which output target variable 
is easier to predict given the proposed features. Table 2 
presents these results, together with the pairwise accuracy 
(Pair.) and Spearman’s Rank Order Correlation Coefficient 
(SROCC) obtained for each machine learning model with 
the standard error bounds based on 5-fold cross validation. 
The larger the accuracy value, the better performing the 
model is. 


These results suggest that raw LMNET may be the most ap- 
propriate target label, particularly since the proposed fea- 
tures seem to be more useful when building a model for 
predicting raw LMNET. These results do not contradict the 
literature, both educational and non-educational, as MNET 
has been used as the gold-standard way of quantifying en- 
gagement. Our experiments thus showed that the use of sub- 
jective assessment inspired transformations do not improve 
the predictive power of engagement signals. This may be 
because these transformations/correction methods are ini- 
tially designed to address biases in latent user preferences. 


Although similar biases may exist in learners when consum- 
ing educational materials (e.g. learner fatigue, different en- 
gagement thresholds, language level preferences, etc.) we 
hypothesise that the most influential driver of engagement 
is the information content and style of the video. 


Another observation from Table 2 is that KRR and KSVR 
models outperform their linear versions. This suggests that 
there could be non-linearity in the dataset that is better 
captured by the kernel techniques. RF seems to be the 
best performing model providing more evidence that non- 
linearity plays a significant role. 


To show how the accuracy changes when the difference of 
MNET between two lectures changes, we first compute all 
the possible differences between pairs of lectures and bina- 
rize these pairs into bins of size 0.1 from 0 to 1, finally we 
compute the pairwise accuracy for each bin. Figure 2 shows 
how the performance of the model changes based on the 
difference of MNET between lecture pairs. The bars in the 
figure represent the pairwise accuracy for all the pairs that 
belong to the same bin. For example, the pairs with largest 
difference of MNET are predicted correctly with 0.962 accu- 
racy whereas pairs with the smallest difference are predicted 
with 0.642 accuracy. 


Intuitively, a learner might have a similar experience con- 
suming a pair of video lectures that are similarly engaging 
(at least disregarding the topic), as one is less likely to notice 
the difference. The black line in Figure 2 presents the cumu- 
lative pairwise accuracy of the model if we were to assume 
that the learners are insensitive to noticing the difference of 
experience for lecture pairs that have a small difference of 
MNET. The plotted cumulative pairwise accuracy (y-axis) 
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Figure 2: Bar chart plot showing how the pairwise 
accuracy changes based on the difference of MNET 
between lecture pairs 


is computed by restricting the comparisons to lecture pairs 
with a difference of MNET between the lower bound of the 
x-axis value and 1.0. For instance, the cumulative pairwise 
accuracy of the model is 0.816 when the learners do not no- 
tice the difference when interacting with similarly engaging 
lecture pairs with MNET difference of [0.0, 0.2]. This value 
is the pairwise accuracy of all the lecture pairs with a MNET 
difference of ]0.2, 1.0]. 


4.2.3 E3: Length of Materials vs. Engagement 
Several studies have shown that features that quantify mate- 
rial length have a significant impact (this is also reaffirmed 
by our observations in our feature importance analysis in 
Figure 6 and 7) on sustained engagement with the mate- 
rial [17, 14]. We investigate how the length of the lectures 
impacts engagement prediction (i.e. if the engagement pre- 
dictor is naively distinguishing between long vs. short video 
lectures). We first investigate the distribution of total word 
count in the video lectures (Figure 3), which is directly re- 
lated to the length. Based on the observed multi-modal 
distribution, we make two groups, i) short lectures of less 
than 5000 words and ii) long lecture (see engagement dis- 
tribution in Figure 4). It can be seen that, as anticipated, 
the percentage of watch time tends to be shorter for long 
lectures. 


We investigate how median engagement labels are distributed 
in the aforementioned groups and also how the pairwise 
accuracy differs among and between the groups. Figure 
5 shows that the model is better at comparing between 
short-short lecture pairs compared to long-long lecture pairs. 
In the context of VLN dataset, this is good because there 
are more short lectures than long lectures (Figure 3). Re- 
cent findings (e.g.[17]) also encourage authors to make short 
videos, increasing the likelihood of future video productions 
being short lectures. MNET distribution in Figure 4 shows 
that long lectures have a more skewed target value distri- 
bution concentrated closer to 0 compared to short lectures 
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Figure 3: Distribution of word count of video lec- 
tures 
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Figure 4: Distribution of engagement labels for 
short and long lectures. 


Table 3: Pairwise accuracy with standard error via 
5-fold cross validation for RF model using content- 
based features vs. content-based + video-specific 


features. 
Model Pairwise Accuracy 
Subject-agnostic | Subject-specific 
Content-based Features .7244.014 -733+.018 
Video-specific Features -744+.011 -755+.014 


suggesting that learners tend to consume smaller fractions 
of long videos. This is likely to be driven by factors beyond 
other measured features of the lectures, such as limited time 
availability and short attention span of learners. 


4.2.4 E4: Video-Features and Subject Areas 

Table 3 shows how the pairwise accuracy increases when 
restricted to subject-specific comparisons (lecture pairs be- 
longing to the same subject area). This is clearly an advan- 
tage, given that most often, an educational recommendation 
system needs to make choices among sets of resources that 
belong to the same subject area. 


Proceedings of The 13th International Conference on Educational Data Mining (EDM 2020) 56 


0.85 4 


Pairwise Accuracy 


All Short-Short Only Long-Long Only 
Lecture Comparisons 


Short-Long Only 


Figure 5: Accuracy bar chart for different types of 
comparisons using short and long lecture labels. 


Table 4: Influence of content-based features on en- 
gagement as per their verticals outlined in [10]. 


Quality Vertical | Feature MAS|% MAS 
Topic Coverage Word Count .250 366 
Freshness Published Date OF 157 
Understandability | Easiness .052 .076 
Understandability | Stop-word Coverage Rate| .042 -061 
Presentation Normalization Rate .039 .058 
Topic Coverage Title Word Count .039 .057 
Presentation To Be Rate 038 .055 
Topic Coverage Document Entropy 033 .048 
Understandability | Stop-word Presence Rate | .028 .041 
Presentation Conjunction Rate .019 .028 
Presentation Preposition Rate .014 -020 
Presentation Pronoun Rate .013 -020 
Presentation Auxiliary Rate .009 013 


Table 3 additionally shows how the performance differs when 
using exclusively the cross-modal set of features and when 
adding video specific features. The addition of video fea- 
tures increase the performance by approximately 2%. This 
result shows that there is a compromise in performance when 
restricting features to cross-modal features although the fea- 
ture extractors can be reused in a practical scenario. 


4.2.5. E5: Feature Importance Analysis 

The SHAP value summary plots for content-based and video- 
specific feature sets are presented in Figures 6 and 7 respec- 
tively, where the features are ordered based on overall fea- 
ture influence using the best performing prediction model 
(RF). Colour represents the raw feature value (blue low, red 
high). For example, when the observed values of a feature is 
red and they have a negative SHAP value, this means that 
higher values of this feature negatively impact LMNET pre- 
diction. Regarding video length, figures validate its impact 
on engagement, showing that long videos generally present 
lower engagement and vice versa, with lecture duration and 
word count being the most relevant features. Prior studies 
confirm this observation [17, 40, 15]). 


Table 4 complements Figure 6 by giving a more quantita- 
tive representation of how the influence of different features 
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Figure 6: SHAP summary plot for cross-modal fea- 
tures. 


across the test dataset changes. Higher MAS is associated 
with more important features. By looking at the five most 
influential features, we observe that all identified quality ver- 
ticals (topic coverage, understandability, freshness and pre- 
sentation) are represented. This observation supports the 
importance of considering all the different verticals when 
predicting context-agnostic engagement. The influence of 
top features is also consistent with results on quality biased 
information search [3] where it is also found that Title Word 
Count is comparatively less important. Figures 6 and 7 also 
show the importance of modality-specific features in this pre- 
diction task by raising Lecture Duration, Silence Period Rate 
and Speaker Speed in Figure 7 to high ranks. 


4.2.6 E6: Population-based vs. Personalised 

We use the 20 most active learners from the VLN dataset 
to compare the predictive performance of context-agnostic 
to contextual/personalised models when predicting engage- 
ment. Firstly, we train the population-based prediction model 
using the VLN dataset (outlined in section 3.1) using a 70:30 
train-test split. In order to build the personalised model, 
for each user, we make a similar 70:30 train-test split re- 
specting the temporal order of their individual events. We 
use the training data to build a personalised model per user 
using only the cross-modal set of features (no video-specific 
features). For each learner £, we make predictions on the 
N¢ test events using (i) population-based model and (ii) the 
personalised model trained on personal events of the learner. 
We calculate Mean Absolute Error (MAE(€)) as: 


Ne 
1 : 
n=1 


where Y,, is the prediction. As regression models are de- 
vised for the task, MAE is a sensible evaluation metric to 
measure predictive performance of the models. Then we 
calculate the difference of MAE(£) between the population- 
based and personalised model. Thus, a negative value in- 
dicates that the population model is better and vice versa. 
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Figure 7; SHAP summary plot with video-specific 
features. 


Figure 8, where the y-axis represents the difference in per- 
formance between the population-based and personalised, 
shows that the population-based model has better predic- 
tive power when the number of training examples available 
for the individual learner is limited (~ 60). This is repre- 
sented by the green line (at a MAE difference of 0). This 
demonstrates the usefulness of the population-based engage- 
ment prediction model in a situation where the recommender 
system is in a cold-start phase. 
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Figure 8: How the difference between Mean Abso- 
lute Error (MAE) of population-based and person- 
alised models change with the number of training 
events per learner. Each data point is an individual 
learner in the dataset. 


Table 5: Pairwise accuracy for STEM and Miscel- 
leneous (Misc.) lectures when trained with subject- 
agnostic and subject-specific training data 

Training Data Test Data 
STEM | Misc. 
Subject-agnostic 737 -708 
Subject-specific -732 -704 


4.2.7 E7: Individual Models per Knowledge Area 
To understand if training subject-specific models can im- 
prove on the predictive power of the overall task, we parti- 
tion the lecture records into 2 categories: 


e STEM: Life Sciences, Physics, Technology and Math- 
ematics. 


e Miscellaneous: Social Sciences, Humanities, Arts and 
Philosophy. 


Then, we compare the performance of the models trained 
on subject-agnostic (STEM + miscellaneous) and subject- 
specific (STEM only or miscellaneous only) training data. 
Table 5 demonstrates that there is little evidence in our re- 
sults contradicting that a common subject-agnostic engage- 
ment model can be assumed across knowledge areas. This is 
shown in the fact that both training with all knowledge ar- 
eas or dividing into two, the models obtain very similar test 
accuracy for each category (.737 vs .732 and .708 vs .704). In 
fact, the best performance is obtained in both cases by train- 
ing with the whole dataset. This indicates that in general 
a common engagement model can be assumed throughout 
knowledge areas. 


4.3 Limitations 

Firstly, the model does not include features that capture au- 
thority of content or its authors. Authority has been iden- 
tified as an influential feature and lacking it is a weakness 
of this model. However, identifying an authority indicator 
that generalises beyond niche communities (e.g. academia) 
is challenging yet necessary, especially in the OER landscape 
where anyone can author learning materials. Additionally, 
the topic coverage features used in this model (Word Count, 
Title Word Count and Document Entropy) are relatively 
naive, although they are useful. Having better features will 
likely improve the model. The current work demonstrates 
promise in predicting learner engagement with video lectures 
using easily automatable material features alone. More so- 
phisticated features, both cross-modal and modality-specific 
could lead to higher predictive performance and better un- 
derstanding of context-agnostic engagement. Thirdly, the 
engagement model is trained on English lectures and En- 
glish translation of non-English lectures. This impacts the 
generalisation ability of the model. The same applies to 
non-video content as well. More rigorous testing is needed 
in these fronts. Lastly, given that our dataset only consid- 
ers OERs and excludes the learning dimension, we highlight 
that some of our findings may not be directly applicable to 
other type of educational material. Particularly, given that 
most of our features are language-based and we disregard 
visual information, the built models may not generalise to 
general purpose videos. 
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5. CONCLUSIONS 


Given its timely need, we set out to develop and empirically 
test the suitability of engagement prediction models for au- 
tomatically assessing context-agnostic engagement of OERs. 
Due to the scarcity of publicly available datasets for the task, 
we sourced a new video-lectures dataset and evaluated how 
different machine learning models perform on this dataset. 
In our analysis, we observed that the Random Forest algo- 
rithm performs best. We show that cross-modal features 
provide satisfactory performance, which is a major advan- 
tage, since these can be extracted from different resource 
modalities. Further experiments show that the predictive 
performance of the model can gain a slight boost in perfor- 
mance by adding modality-specific features. However, the 
performance does not deviate significantly. Feature analysis 
showed that lecture length features are the most influential 
features in predicting context-agnostic engagement, which 
agrees with prior work. Other moderately influential fea- 
tures come from diverse quality verticals. Our analysis also 
showed that the model classifies much better when lectures 
with very different engagement values are compared, as op- 
posed to lectures with similar engagement. This is natural 
and obviously the negative impact of misranking pairs of 
similar engagement lectures is relatively small. Our exper- 
iments demonstrated that the built model is useful in data 
scarcity scenarios, e.g. to approach the common cold-start 
problem in recommender systems. This is both for new users 
and new content, as our model can automatically estimate 
the engagement for new material and the model can be used 
as a prior for when we do not have enough data from a 
user to build a personalised model. We finally show that di- 
viding the dataset into different knowledge areas (Subjects) 
and building separate models does not show improved per- 
formance, thus validating that a common underlying model 
can be built for estimating engagement across differentiated 
knowledge areas. 


The proposed context-agnostic engagement prediction model 
can be beneficial in improving different components of an ed- 
ucational recommendation system. In situations where new 
content is discovered frequently (e.g. OER landscape [27, 
7|), the proposed prediction model estimates how engaging 
materials are prior to exposing them to the learner pop- 
ulation. This allows better balancing the risks relating to 
learner satisfaction with opportunities of having fresh ma- 
terials. Also, the proposed context-agnostic model can be 
integrated with a personalisation system in different ways. 
It can act as a prior that mitigates cold-start problem both 
on user and content fronts. In systems where personalisa- 
tion heavily focuses on the topics covered in the materials 
[9], this model can complement the content-based model by 
accounting for stylistic and lingual features that go beyond 
topic coverage. 


To further improve the models, future work should address 
the three main limitations discussed: Future versions of our 
model should incorporate more sophisticated features. It 
could be beneficial to include features capturing authority 
and topic coverage [10]. In this sense, Wikification [6] can 
be used to extract covered topics, and data driven authority 
features, such as [1], can be used to learn a universal author 
authority score. In the cross-modal front, more features fo- 
cusing on content understanding, such as topic coherence 


and argument strength, can be considered. In the video- 
specific front, features such as liveliness of the presenter, 
sound quality and narration quality can be incorporated. 
Regarding the generalisation capabilities of the model, eval- 
uating the effectiveness of the cross-modal feature set with a 
bigger video lecture dataset [17, 40] and a text dataset [14] 
will increase the confidence on the feature set. Similarly, 
non-English datasets should also be taken into account. 
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